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Abstract — A collaborative convex framework for factoring a 
data matrix X into a non-negative product AS, with a sparse 
coefficient matrix S, is proposed. We restrict the columns of 
the dictionary matrix A to coincide with certain columns of the 
data matrix X, thereby guaranteeing a physically meaningful 
dictionary and dimensionality reduction. We use /l oo regular- 
ization to select the dictionary from the data and show this 
leads to an exact convex relaxation of lo in the case of distinct 
noise free data. We also show how to relax the restriction-to-X 
constraint by initializing an alternating minimization approach 
with the solution of the convex model, obtaining a dictionary 
close to but not necessarily in X. We focus on applications 
of the proposed framework to hyperspectral endmember and 
abundances identification and also show an application to blind 
source separation of NMR data. 

Index Terms — Non-negative matrix factorization, dictionary 
learning, subset selection, dimensionality reduction, hyperspec- 
tral endmember detection, blind source separation 

I. Introduction 

DIMENSIONALITY reduction has been widely smdied 
in the signal processing and computational learning 
communities. One of the major drawbacks of virtually all 
popular approaches for dimensionality reduction is the lack 
of physical meaning in the reduced dimension space. This 
significantly reduces the applicability of such methods. In this 
work we present a framework for dimensionality reduction, 
based on matrix factorization and sparsity theory, that uses the 
data itself (or small variations from it) for the low dimensional 
representation, thereby guaranteeing the physical fidelity. We 
propose a new convex method to factor a non-negative data 
matrix X into a product AS, for which S is non-negative and 
sparse and the columns of A coincide with columns from the 
data matrix X. 

The organization of this paper is as follows. In the remainder 
of the introduction, we further explain the problem, summarize 
our approach and discuss applications and related work. In 
Section |ll] we present our proposed convex model for end- 
member (dictionary) computation that uses /i oo regularization 
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to select as endmembers a sparse subset of columns of X, 
such that sparse non-negative linear combinations of them are 



capable of representing all other columns. Section III shows 
that in the case of distinct noise free data, Zi oo regularization 
is an exact relaxation of the ideal row-0 norm (number of 
non-zero rows), and furthermore proves the stability of our 



method in the noisy case. Section IV presents numerical results 



for both synthetic and real hyperspectral data. In Section |V] 
we present an extension of our convex endmember detection 
model that is better able to handle outliers in the data. We 
discuss its numerical optimization, compare its performance to 
the basic model and also demonstrate its application to a blind 
source separation (BSS) problem based on NMR spectroscopy 
data. 



A. Summary of the problem and geometric interpretation 

The underlying general problem of representing X w AS 
with A, 5 > is known as non-negative matrix factorization 
(NMF). Variational models for solving NMF problems are 
typically non-convex and are solved by estimating A and S 
alternatingly. Although variants of alternating minimization 
methods for NMF often produce good results in practice, they 
are not guaranteed to converge to a global minimum. 

The problem can be greatly simplified by assuming a partial 
orthogonality condition on the matrix S as is done in |1|, 
111. More precisely, the assumption is that for each row i 
of S there exists some column j such that Si,j > and 
Sk.j — for k ^ i. Under this assumption, NMF has a 
simple geometric interpretation. Not only should the columns 
of A appear in the data X up to scaling, but the remaining 
data should be expressible as non-negative linear combinations 
of these columns. Therefore the problem of finding A is to 
find columns in X, preferably as few as possible, that span a 
cone containing the rest of the data X. Figure [T] illustrates the 
geometry in three dimensions. 

The problem we actually want to solve is more difficult 
than NMF in a couple respects. One reason is the need to 
deal with noisy data. Whereas NMF by itself is a difficult 
problem already, the identification of the vectors becomes even 
more difficult if the data X contains noise and we need to 
find a low dimensional cone that contains most of the data 
(see lower right image in Figure [TJ. Notice that in the noisy 
case, finding vectors such that all data is contained in the 
cone they span would lead to a drastic overestimation of the 
number of vectors. Arbitrarily small noise at a single data 
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point could already lead to including this vector into the set 
of cone spanning vectors. Thus, the problem is ill-posed and 
regularization is needed to handle noise. In addition to small 
noise there could also be outliers in the data, namely columns 
of X that are not close to being well represented as a non- 
negative linear combination of other columns, but that we do 
not wish to include in A. Such outliers could arise from bad 
sensor measurements, non-physical artifacts or any features 
that for some reason we are not interested in including in our 
dictionary A. Another complication that requires additional 
modeling is that for the applications we consider, the matrix 
S should also be sparse, which means we want the data to be 
represented as sparse non-negative linear combinations of the 
columns of A. 
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Fig. 1: Geometric interpretation of the endmember detection 
problem. First row: two different viewpoints for a data set 
in three dimensions, second row: same data set with the 
vectors that can express any data point as a non-negative linear 
combination in red, third row left: cone spanned by the three 
vectors containing all data, third row right: illustration of the 
cone with Gaussian noise added to the data, in this case not 
all points lie inside the cone anymore 

B. Our proposed approach 

We obtain the X — AS factorization by formulating the 
problem as finding a sparse non-negative T such that X « XT 
and as many rows of T as possible are entirely zero. We want 
to encourage this so called row sparsity of T in order to select 
as few as possible data points as dictionary atoms. We do this 
by using Ix^ao regularization. This type of regularization cou- 
ples the elements in each row of T and is based on the recent 
ideas of collaborative sparsity (see for example |I3| and the 



references therein). The particular regularization has been 
studied by Tropp in f45, however, without considering non- 
negativity constraints and also not in the setting of finding a T 
such that X w XT for physically meaningful dimensionality 
reduction. A strong motivation for using Zi oo regularization 
instead of other row sparsity regularizers like l\i is that in 
the case of distinct, noise-free data, the oo model is an exact 
relaxation of minimizing the number of non-zero rows in T 
such that X = XT. This exact relaxation is independent of 
the coherence of the columns of X. Without the non-negativity 
constraint, a low coherence is crucial as is shown in |4j. The 
general setting X « XT was proposed by Lin et al. in for 
low rank approximation with the nuclear norm. However, the 
nuclear norm does not lead to row sparsity, and thus there is no 
obvious way to extract dictionary atoms from the minimizer 
Both the nuclear norm and oo approaches are addressing 
related but different problems. Our main contribution is to 
apply the joint sparsity regularizer /i oo to the non-negative 
factorization setting X w XT and thereby implicitly select 
certain columns of X for the description of all columns of 
X. We pose this as a convex optimization problem in T. For 
practical reasons, we will need to perform some preliminary 
data reduction, explained in Section before carrying out the 
convex minimization. The main idea, however, is to minimize 
over r > the li,oo norm of T plus some convex penalty 
on X — XT. In the simplest case, we penalize \\X ~ XTWj^. 
We also propose an advanced noise model to handle the case 
where X contains outliers. Both models also incorporate a 
weighted li penalty to encourage a sparser T so that from the 
few columns of X selected to represent the whole data, only 
a few are used per sample. 

C. Applications and related work 

Although we concentrate on hyperspectral imaging (HSI) 
and briefly discuss an application to blind source separation 
(BSS), our method is applicable in numerous areas, from 
biology to sensor networks. 

For instance, one approach to text mining, that is the 
technique of extracting important information from large text 
data sets, is to reduce the number of relevant text documents 
by clustering them into content dependent categories using 
non-negative matrix factorization (see [6| for details). Our 
approach could potentially be used to not only cluster the large 
amount of documents by a similar factorization, but due to the 
dictionary being a part of the data, it would furthermore lead 
to a correspondence of every atom in the dictionary to a certain 
document. This correspondence might help a human analyzer 
judge the importance of each cluster Therefore the physical 
meaning of the dictionary atoms would have an immediate 
advantage for the further analysis. 

As another example, in |7 1 non-negative matrix factorization 
is applied to spectrograms of many different musical sounds in 
order to obtain a spectral basis for musical genre classification. 
Again the physical fidelity of our approach could be interesting 
since it would provide a correspondence of each spectral basis 
element to a certain musical sound. 

From the numerous potential applications, we concentrate 
on two to illustrate the proposed framework. We describe next 
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the challenges in these applications. This early description of 
these applications will help to further motivate the work. 

1) Introduction to hyperspectml imaging (HSI): HSI sen- 
sors record up to several hundred different frequencies in 
the visible, near-infrared and infrared spectrum. This precise 
spectral information provides some insight on the material at 
each pixel in the image. Due to relatively low spatial resolution 
and the presence of multiple materials at a single location (e.g., 
tree canopies above ground or water and water sediments), 
many pixels in the image contain the mixed spectral signatures 
of multiple materials. The task of determining the abundances 
(presence quantity) of different materials in each pixel is called 
spectral unmixing. This is clearly an ill-posed problem that 
requires some assumptions and data models. 

Unmixing requires a dictionary with the spectral signatures 
of the possible materials (often denoted as endmembers). 
Since these dictionaries can be difficult to obtain and might 
depend on the conditions under which they were recorded, 
it is sometimes desirable to automatically extract suitable 
endmembers from the image one wants to demix in a process 
called endmember detection. Many different techniques for 
endmember detection have been proposed, see O and refer- 
ences therein. Related although not yet applied to endmember 
detection are subset selection methods like the rank-revealing 
QR decomposition (e.g. ||9l, ifTOl ). which finds the most lin- 
early independent columns of a matrix Q However, unlike our 
approach, QR methods do not take non-negativity constraints 
into account. The general principle behind rank-revealing QR 
for subset selection is to find a column permutation of the data 
matrix such that the first few columns are as well conditioned 
as possible |10|. 

Simultaneously detecting endmembers and computing abun- 
dances can be stated as factoring the data matrix X G M™''' 
into X ^ AS, A,S> 0, with both A e M'"-" and S e M"''' 
being unknown. In this notation each column of X is the 
spectral signature of one pixel in the image. Hence, m is the 
number of spectral bands, d is the total number of pixels, and 
each of the n columns of A represents one endmember. The 
abundance matrix S contains the amounts of each material in 
A at each pixel in X. The application of NMF to hyperspectral 
endmember detection can for instance be found in lfT2ll . [TSl. 

Considering that while material mixtures in HSI exist, it 
is unlikely that pixels contain a mixture of all or many 
of the materials in A, researchers have recently focused on 
encouraging sparsity on the abundance matrix S |fT4l . ifTSl . 
Motivated by the ideas of dictionary learning for sparse 
coding, the works fTSl, f\n\ proposed explicitly to look for 
endmember matrices that lead to sparse abundance maps. We 
follow a similar idea, though our method will be fundamentally 
different in two aspects: First, we restrict columns of our 
dictionary A to appear somewhere in the data X. This is a 
common working hypothesis for moderate ground sampling 
distance images and is called pixel purity assumption. It corre- 
sponds to the partial orthogonality assumption on S discussed 

'independent of the work here described, Laura Balzano, Rob Nowak 
and Waheed Bajwa developed a related matrix factorization technique and 
connected and compared it to RRQR. We thank Laura for pointing out their 
work II II and also the possible relationships with RRQR. 



previously. In the general context of dictionary learning and 
non-negative matrix factorization it guarantees the columns 
of A to be physically meaningful. As mentioned above, the 
lack of physical interpretation has been a critical shortcoming 
of standard dimensionality reduction and dictionary learning 
techniques, and not yet addressed in these areas of research. 
Second, choosing the dictionary columns from the data enables 
us to propose a convex model and hence avoid the problem 
of saddle points or local minima. 

Different areas of applications use different terminology for 
mathematically similar things. Throughout this paper we will 
use the terminology of hyperspectral unmixing to explain and 
motivate our model, although its application is not exclusively 
in the field of HSI. For instance, an endmember could be any 
abstract column of the dictionary matrix A with a different 
physical meaning depending on its context. However, we think 
it aids the clarity and understanding of the model to use the 
HSI example throughout the explanations. To show that our 
model is not limited to the hyperspectral case we also present 
results for the problem of blind source separation, which we 
will briefly describe in the next subsection. 

2} Introduction to blind source separation (BSS): Blind 
source separation is the general problem of recovering un- 
known source signals from measurements of their mixtures, 
where the mixing process is unknown but often assumed to be 
linear. Examples include demixing audio signals and nuclear 
magnetic resonance (NMR) spectroscopy. Many BSS prob- 
lems have the same linear mixing model X — AS that we are 
using for hyperspectral endmember detection and unmixing, 
often also with a nonnegativity constraint on S Here the 
rows of S represent the source signals, A is the unknown 
mixing matrix and the rows of X are the measured signals. 
The analogy of the hyperspectral pixel purity assumption can 
also be applied to some BSS problems |[T|. The interpretation 
is that for each source there is some place where among all 
the sources only that source is nonzero. Thus up to scaling, 
the columns of A should appear somewhere in the columns 
of the data matrix X and we can use the same algorithm we 
use for endmember detection. 

II. Convex endmember detection model 

A. Convexification of matrix factorization using the pixel 
purity assumption 

As mentioned above, we are assuming at first that the 
endmembers can be found somewhere in the data X. This 
assumption will enable us to propose a convex model for 
factoring the data X into a product AS, a problem usually 
tackled by non-convex optimization techniques. We assume 
that there exists an index set / such that the columns Xi of 
X are endmembers for i E I. Under the assumption of non- 
negative linear mixing of signals, this means that any column 
Xj in X can be written as 

lei 

for coefficients Tij > 0. The problem is that the coefficients 
Ti j as well as the index set / are unknown. Hence, we start 
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by using all columns in X to describe X itself, i.e. we look 
for coefficients T > for which 



X = XT. 



(2) 



Notice that Equation (j2|i has many solutions. However, we 
know that the desired representation uses as few columns of 
X as possible, i.e., only the endmembers. Since not using the 
j*'' column in the above formulation corresponds to having 
the entire row of T be zero, we can reformulate the 
endmember detection problem as finding a solution to (j2]) such 
that as many rows of T as possible are zero. Mathematically, 



min ||r||i-o„-o such that XT 

T>0 



X, 



(3) 



where ||T||row-o denotes the number of non-zero rows. The 
columns of X that correspond to non-zero rows of the mini- 
mizer T of ([3]) are the desired endmembers (which define the 
lowest dimensional subspace where the data resides). Since 
problem Q is not convex we relax the above formulation 
by replacing ||T||row-o by the convex Zi oo norm ||T||i^oo = 
maxj |Ti jj. The li part of the norm should encourage 
sparsity. Combined with the maximum norm we can expect 
that the vector of maxima among each row becomes sparse 
which means we obtain row sparsity. Notice that we could have 
chosen other row-sparsity encouraging regularizers such as 
^1 2- However, we will prove in Section III the oo relaxation 



is exact in the case of normalized, non-repeating data, which 
makes it clearly preferable. 

As mentioned in the previous section, it is important to 
take noise into account and therefore the equality Q is too 
restrictive for real data. Hence, we will introduce a parameter 
that negotiates between having many zero rows (also called 
row sparsity) and good approximation of the data — Xr|||, 
in the Frobenius norm. Furthermore, for T to be a good 
solution, not only should most of its rows be zero, but the 
nonzero rows should also be sparse, since sparsity of the co- 
efficient matrix reflects physically reasonable prior knowledge. 
For hyperspectral unmixing the additional sparsity requirement 
on T reflects the assumption that most pixels are not mixtures 
of all the selected endmembers, but rather just a few. Thus, 
we will add an additional weighted li term to incorporate 
this second type of sparsity (see lITSl for a model combining 
structured and collaborative sparsity with individual sparsity) 



B. Data reduction as preprocessing 

It is already clear from Equation (j2]i that the problem is 
too large because the unknown matrix T is a d x d matrix, 
where d is the number of pixels. Thus, before proceeding 
with the proposed convex model, we reduce the size of the 
problem by using clustering to choose a subset of candidate 
endmembers Y from the columns of X and a submatrix 



Xs G 



of X for the data with < rf. In other 



words, we want to reformulate the problem as YT w Xs 
with Xs G M^x'is Y G M"X"s T G E"-^'^' with « d, 
ds < d. We use Xg = Y in our experiments but could also 
include more or even all of the data. We use k-means with 
a farthest-first initialization to select Y. An angle constraint 



{Yi,Yj) < .995 ensures the endmember candidates, namely 
the columns of Y, are sufficiently distinct. An upper bound is 
placed on the number of allowable clusters so that the size of 
the problem is reasonable. We then propose a convex model 
for the more manageable problem of finding a nonnegative 
T such that YT w Xg, with T having the same sparsity 
properties described above. Note that we have not convexified 
the problem by pre-fixing the dictionary Y. This is done 
simply to work with manageable dimensions and datasets. Our 
convex model will still select the endmembers as a subset of 
this reduced dataset Y, namely the columns of Y that will 
correspond to non-zero rows of T. 

C. The endmember selection model 

Our model consists of a data fidelity term and two terms 
that encourage the desired sparsity in T. For simplicity, we 
consider the data fidelity term 



MYT- 



) Cw 1 1 F J 



(4) 



\f denotes the Frobenius norm, /3 is a positive 



T>d,xd, 



is a diagonal matrix we can use to 



where 

constant, and G 
weight the columns of (YT—Xs) so that it reflects the density 
of the original data. As mentioned earlier we encourage rows 
of T to be zero by penalizing with CI|2^l|i,oo, which for non- 
negative T equals ( J^i maxj (Tij), with ^ a positive constant, 
so that only a few samples are cooperatively selected as 
endmembers. This kind of collaborative/structured sparsity 
regularizer has been proposed in several previous works, for 
example |H, lfT9ll . 

To encourage sparsity of the nonnegative T, we use a 
weighted li norm, {R^^aCii,,T). Here is a diagonal matrix 
of row weights. We choose to be the identity in our 
experiments, but if for example we wanted to encourage 
selection of endmembers towards the outside of the cone 
containing the data, we could also choose these weights to be 
proportional to {Yj,Y), where Y is the average of the columns 
of Y. This would encourage the method to prefer endmebers 
further away from the average Y. The weighting matrix a has 
the same dimension as T, and the weights are chosen to be 



aij = 1^(1 - e ^ ), 



(5) 



for constants h and i^. This means that atj is small when the 
ith column of Y is similar to the jth column of Xs and larger 
when they are dissimilar This choice of weights encourages 
sparsity of T without impeding the effectiveness of the other 
regularizer Since the smallest weight in each row occurs at 
the most similar data, this helps ensure a large entry in each 
nonzero row of T, which makes the row sparsity term more 
meaningful. It also still allows elements in a column of T to 
be easily shifted between rows corresponding to more similar 
endmember candidates, which can result in a reduction of 
C niaxj (Tj J ) without significantly affecting the weighted 
li term. Since the weighted li penalty here is really just a 

^The work mentioned above by Balzano, Nowak and Bajwa uses block 
orthogonal matching pursuit and also mentions the possible use of || ■ ||2 
instead of 1 1 ■ 1 1 oo ■ 
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linear term, it can't be said to directly enforce sparsity, but by 
encouraging each column of T to sum to something closer to 
one, the weighted li penalty prefers data to be represented by 
nearby endmember candidates when possible, and this often 
results in a sparser matrix T. Overall the proposed convex 
model is given by 

^1 

t 

(6) 

For our experiments we normalize the columns of X to have 
unit I2 norm so that we discriminate based solely on spectral 
signatures and not intensity. 

D. Refinement of solution 

Since we are using a convex model to detect endmem- 
bers, it cannot distinguish between identical or very similar 
endmember candidates, which we will discuss from a more 



theoretical point of view in Section III However, the model 
works very reliably when the columns of Y are sufficiently 
distinct, which they are by construction. A limitation is that 
the convex model is unable to choose as endmembers any data 
not represented in Y. Nonetheless, as is shown in Section IV 
the results of this approach already compare favorably to other 
methods. Moreover, it provides an excellent initialization for 
the alternating minimization approach to NMF, which can be 
used to further refine the solution if desired. Letting A be 
the endmembers selected by the convex model, the solution is 
refined by alternately minimizing 

1 



mm 

A>0,S>0,\\Aj-Aj\\2<aj 



\AS^X\\j, + {R^a,S) (7) 



and renormalizing the columns of A after each iteration. Here, 
aj is the diameter of the jth cluster containing the data near 
Aj, and ensures that the refined endmembers obtained by this 
alternating approach cannot be too different from those already 
selected by the convex model, thereby remaining as close as 
desired to the physical space. 

To recover the full abundance matrix S without refining A, 
we can solve the convex minimization problem in Equation (j?]) 
for S using the full data matrix X e ^mxd original 
endmembers A E selected by the convex model. 

E. Numerical optimization 

We use the alternating direction method of multipliers 
(ADMM) 1201, EH to solve (|6| by finding a saddle point 
of the augmented Lagrangian 

L,-(Z,T,F) -5>o(T) + C Vmax(T,.,) 



/3, 



{YZ — Xs)Cw\\^p 



{P,Z-T) + -\\Z-T\ 



where g>o is an indicator function for the T > constraint. 

In iteration fe + 1 the algorithm proceeds by minimizing first 
Ls{Z, T^, P'^) with respect to Z to get then minimizing 

Ls{T,Z^+'^,P^) with respect to T to get T'^+i, and then 



updating the Lagrange multiplier P by p'^+i = P^+5{Z^^^ — 
T'^^^). Each minimization step is straightforward to compute, 
and the algorithm is guaranteed to converge for any (5 > 0. 

Note that it is faster to precompute and store the inverse 
involved in the update for Z, but this can be overly memory 
intensive if Y has many columns and 7^ I- One could use 
a more explicit variant of ADMM such as the method in ll22ll 
to avoid this difficulty if a larger number of columns of Y 
is desired. Here we have restricted this number to be less 
than 150 and ADMM can be reasonably applied. 

Also note that the minimization problem for T'^^^, 



T 



k+i 



= argm^n5>o(T) +C 



max(Tij) 



P^ 

2" 5 



2 



(8) 



decouples into separate problems for each row T^. The Legen- 
dre transformation of g>Q{Ti) + (ma.Xj{Tij) is the indicator 
function for the set C^; = {F, G R'^^ : || max(F,,0)||i < (}■ 
Let f '^■+1 = Z'^+i + ^ - ^"g'^" . Then by the Moreau 
decomposition ||23| . the minimizer T^+'^ is given by 

7 

where IIc^ (j'fc+i) orthogonally projects each row of T''+^ 
onto Cc, which is something that can be computed with 
complexity 0{ds log(ds)). 

The other algorithm parameters we use are 6 ^ 1, ( = 1, 
/3 = 250, = 50, and /i = 1 - cos(47r/180). In our 
experiments, we also choose Xs = Y. We then define column 
weights Cw that weight each column j by the number of 
pixels in the jth cluster (the cluster centered at Yj) divided 
by the total number of pixels d. To refine the solution of 
the convex model, we note that each alternating step in the 
minimization of (|7]i is a convex minimization problem that 
can again be straightforwardly minimized using ADMM and 
its variants. The update for the abundance S is identical to the 
split Bregman algorithm proposed for hyperspectral demixing 
in lfT4l . ifTSl . and its connection to ADMM is discussed in 

El. 



III. The connection between row-O and /i^oo 
In this section we will show that the motivation for our 
model comes from a close relation between the convex li,ao 
norm and the row-O norm, which counts the number of non- 
zero rows. 

A. Distinct noise-free data 

Let us assume we have data X which is completely noise 
free, normalized, obeys the linear mixing model, and contains 
a pure pixel for each material. Under slight abuse of notation 
let us call this data after removing points that occur more 
than once in the image, X again. As discussed in Section [ll] 
the endmember detection problem can now be reformulated 
as finding the minimizer T of ([3]l, where the true set of 
endmembers can then be recovered as the columns of X that 
correspond to non-zero rows of T. 
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The fact that problem Q gives a solution to the endmember 
selection problem is not surprising, since (j3]l is a non-convex 
problem and related problems (for instance in compressed 
sensing) are usually hard to solve. However, due to the non- 
negativity constraint we will show that ^i.oo minimization is 
an exact relaxation of the above problem. 

For any T > with XT = X the entries of T are less than 
or equal to one, Ti,j < 1 1^ Furthermore, the endmembers can 
only be represented by themselves which means that in each 
row with index i, i e /, we have a coefficient equal to 1. We 
can conclude that the li ^x, norm of any T > with XT = X 
is 



> ^^maxTjj = |/| 



(9) 
(10) 



However, it is possible to have equality in the above estimate 
if and only if Tij ~ for i ^ I. In this case the rows of the 
non-negative li ^o minimizer of XT = X are only supported 
on /, which means it is a minimizer to the row-0 problem ([3]l. 
Vice versa, any row-0 minimizer T has exactly one entry equal 
to one in any row corresponding to an endmember and zero 
rows elsewhere, which means |lr||i.oo — and hence T also 
minimizes the oo norm under the XT = X constraint. We 
therefore have shown the following lemma: 

Lemma III.l. If we remove repeated columns of X and have 
normalized data, the sets of minimizers of 



mm 

T>0 



ITII 



such that XT = X 



and 



min ||T||i^oo such that XT = X 



(11) 



(12) 



are the same. 

Notice that while generally there are other regularizations, 
like for instance Zi 2, which would also encourage row sparsity, 
this lemma only holds for /l oo- For XT — X, the Zi 00 
norm counts the number of rows and is not influenced by 
any value that corresponds to a mixed pixel. This property is 
unique for Zi 00 regularization with a non-negativity constraint 
and therefore makes it the preferable choice in the proposed 
framework. 

B. Noise in the data 

Of course the assumptions above are much too restrictive 
for real data. Therefore, let us look at the case of noisy data 
of the form X^ — X + N, where X is the true, noise-free data 
with no repetition of endmembers as in the previous section, 
and N is noise bounded in the Frobenius norm by || A^Hf < <5. 
We now consider the model 

JciT) = \\XT - X\\l + a|lT|li,oo such that T > 0. (13) 

'This is a simple fact based on the normalization and non-negativity, 1 = 

T.x>0 

W^kW = \\T,iT^,kX^\\ > maxii|T,,feXii| =maxiTi,fc 



The following lemma shows that for the right, noise-dependent 
choice of regularization parameter a, we converge to the 
correct solution as the noise decreases. 

Lemma III.2. Let T be a non-negative \\ ■ Wi o^ -minimum 
norm solution of XT = X, X^ = X + N be noisy data with 
N\\p < S and denote a minimizer of energy functional 



{13} with regularization parameter a and replacing X by X^. 
If a is chosen such that 



a ^ 0, ^ 

a 



as 5 



(14) 



then there exists a convergent subsequence T^^^ and the limit 
of each convergent subsequence is a \\ ■ \\ 1^00 -minimum norm 
solution of XT = X. 

For the sake of clarity we moved the proof of this lemma 



to the appendix. Lemma III. 2 shows that our regularization is 
stable, since for decreasing noise and appropriate choice of the 
parameter a we converge to a non-negative || • ||i 00-minimum 
norm solution of XT — X which - as we know from the 
first part - gives the true solution to the endmember detection 
problem when the columns of X are distinct. While identical 
points are easy to identify and eliminate in the noise free 
data, determining which points belong to the same endmember 
in the noisy data can be very difficult. This is the reason 



for our first data reduction step. Lemma III. 2 tells us that 
for our method to be closely related to the row-0 approach 
we have to avoid having several noisy versions of the same 
endmember in X. We therefore impose an angle constraint 



as described in Section II-B while clustering the data, which 



basically corresponds to an upper bound on the noise and states 
up to which angle signals might correspond to the same point. 

IV. Numerical results for HSI 

In this section we present numerical results on endmem- 
ber detection for supervised and real hyperspectral data and 
compare our results to existing detection algorithms. Since the 
goal of the paper is to present a general new convex framework 
for matrix factorization which is applicable to multiple areas, 
the hyperspectral numerical results are intended mainly as a 
proof of concept. It is therefore encouraging that our method 
is competitive with some of the established methods it is 
compared to in the examples below. 

A. Application to blind hyperspectral unmxing 

1) Supervised endmember detection: For comparison 
purposes we extracted nine endmembers from the 
standard Indian pines dataset (publicly available at httpsj 
//engineering. purdue.edu/biehl/MultiSpec/hyperspectral.html) 
by averaging over the corresponding signals in the ground 
truth region. Then we created 50 data points for each 
endmember, 30 data points for each combination of two 
different endmembers, 10 data points for each combination of 
three different endmembers, and additionally 30 data points as 
mixtures of all endmembers. Finally, we add Gaussian noise 
with zero mean and standard deviation 0.006, make sure our 
data is positive, and normalize it. We evaluate our method 
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in a comparison to N-findr ||2l, vertex component analysis 
(VCA) 05\, |l26l with code from fTT\, an NMF method 
using the ahernating minimization scheme of our refinement 
step with random initial conditions, and the QR algorithm. 
For the latter we simply used MATLAB's QR algorithm 
to calculate a permutation matrix 11 such that XH = QR 
with decreasing diagonal values in R and chose the first 
nine columns of XH as endmembers. Since the success of 
non-convex methods depends on the initial conditions or on 
random projections, we run 15 tests with the same general 
settings and record the average, maximum and minimum 
angle by which the reconstructed endmember vectors deviate 
from the true ones, see Table [I] For the tests we adjusted 
the parameters of our method to obtain 9 endmembers. 



Method 


Evaluation on 15 test runs 




Avg. a 


Min. a 


Max. a 


Ours refined 


3.37 


3.30 


3.42 


Ours without refinement 


3.93 


3.84 


4.01 


VCA 


4.76 


1.78 


6.95 


N-findr 


10.19 


7.12 


13.79 


QR 


9.87 


4.71 


12.74 


Alt. Min. 


4.50 


1.76 


8.17 



TABLE I: Deviation angle from true endmembers 

We can see that our method gives the best average 
performance. Due to a high noise level, methods that rely 
on finding the cone with maximal volume or finding most 
linearly independent vectors, will select outliers as the 
endmembers and do not yield robust results. Looking at the 
minimal and maximal a we see the effect predicted. The 
non-convex methods like alternating minimization and VCA 
can outperform our method on some examples giving angles 
as low as 1.76. However, due to the non-convexity they can 
sometimes find results which are far off the true solution with 
deviation angles of 6.95 or even 8.17 degrees. A big benefit 
of our convex framework is that we consistently produce 
good results. The difference between the best and the worst 
reconstruction angle deviation for our method is 0.15 degrees 
with and 0.17 without the refinement, which underlines its 
high stability. Figure [2] shows the original endmembers as 
well as an example reconstructions by each method with the 
corresponding angle of deviation. 

2) Results on real hyperspectral data: To show how our 
method performs on real hyperspectral data we use the urban 
image (publicly available at www.tec.army.mil/hypercube). 
Figure [3]shows the RGB image of the urban scene, the spectral 
signatures of the endmembers our method extracted, and the 
abundance maps of each endmember, i.e., each row of T 
written back into an image. 

The abundance maps are used in HSI to analyze the dis- 
tribution of materials in the scene. First of all our method 
managed to select six endmembers from the image, which 
is a very reasonable dimension reduction for hyperspectral 
image analysis. We can see that the abundance maps basically 
segment the image into clusters of different material categories 
such as concrete, house roofs, soil or dirt, grass, and two 
different types of vegetation, which all seem to be reasonable 



when visually comparing our results to the RGB image. The 
spectral signatures our method selected are relatively distinct, 
but do not look noisy. Furthermore, the abundance maps seem 
to be sparse which was a goal of our method and reflects the 
physically reasonable assumption that only very few materials 
should be present at each pixel. 

As a next step we compare our results to the ones obtained 
by N-findr, VCA, QR and alternating minimization. Unfortu- 
nately, we have no ground truth for the urban image, which 
is why we will look at the non-negative least squares (NNLS) 
unmixing results based on the endmembers A,n each method 
found. Notice that geometrically NNLS gives the projection 
of the data onto the cone spanned by the endmembers. If 
Sm denotes the NNLS solution of each method, the error 
li^m'5'm — -'^lll^ gives some insight on how much data is 
contained in the cone and therefore, how well the endmembers 
describe the data. However, as discussed earlier, due to noise 
we might not be interested in detecting the pixels furthest 
outside to be endmembers, although they might describe the 
data better. Thus, we will also report the sparsity of each 
cone projection Sm- Since any outside point will be projected 
onto a face or an edge of the cone, the sparsity will give 
some insight into whether the endmember vectors are well 
located. The more the endmembers are towards the inside of 
a point cluster, the more points we expect to be projected 
onto an edge of the cone rather than a face. Thus, a high 
sparsity relates to a reasonable position of an endmember 
Table |ll] shows the relative number of non-zero coefficients, 
i.e., llS'mllo divided by the the total number of entries in 
Sm, as well as the projection error for N-findr, QR, VCA, 
alternating minimization and our method. Figure [4] shows the 
corresponding endmember signals each method found. 





Ours 


N-findr 


VCA 


QR 


Alt. Min. 






533.9 


4185.8 


2516.5 


1857.6 


454.3 




Sm\\o/{d. ■ n) 


0.40 


0.60 


0.47 


0.60 


0.41 



TABLE II: Comparison of different endmember detection 
methods in terms of error and sparsity of the data projection 
onto the cone spanned by the endmembers on the urban image. 
The corresponding endmember signatures are shown in Figure 



We can see from the projection error that the sparse al- 
ternating minimization approach and our method found much 
better representation for the data than N-findr, VCA and QR, 
with the alternating minimization performing slightly better 
than our approach. Furthermore, the sparsity of the projection 
is also higher which indicates more reasonable choices for 
the endmembers. Looking at the spectral signatures in Figure 
|4] we can speculate why this is the case. QR as well as 
N-findr chose very distinct signatures, which are probably 
far outliers in the dataset. Some of the VCA endmembers 
look even more extreme and take negative values, which 
is clearly unphysical and does not allow these endmembers 
to be interpreted as any material. On the other hand the 
endmembers of the alternating minimization approach and of 
our method are very similar and look comparably smooth. 
The average angle of deviation between our method and the 



g 




alternating minimization approach is only 3.4 degrees, which 
lets us conclude that they basically converged to the same 
answer, which is encouraging considering the fact that these 
endmembers describe the rest of the data more than three times 
more accurately than the endmembers found by other methods. 
Furthermore, any signal our method selected differs at most by 
0.036 degrees from an actual point in the data and is therefore 
physically meaningful. 

V. An extended model 

We also propose an extended version of the model that 
takes into account the normalization of the data in order to 
better distinguish between noise and outliers. We show this 
slightly more complicated functional can still be efficiently 
solved using convex optimization. 

A. Error model 

We continue to work with the reduced set of endmember 
candidates Y and a subset of the data Xg, which in practice 
we take to be Y. Instead of penalizing \\{YT — Xs)Cw\\'p, we 
impose the linear constraint YT—Xg = V—Xg diag(e), where 
T,V and e are the unknowns. T has the same interpretation 
as before, V € ^™'xds models the noise, and e G models 
the sparse outliers. Since the columns of Xg are normalized to 
have unit I2 norm, we would also like most of the columns of 
YT to be approximately normalized. In modeling the noise V, 
we therefore restrict it from having large components in the 
direction of Xg. To approximate this with a convex constraint 
V G D, we restrict each column Vj to lie in a hockey puck 
shaped disk Dj. Decompose Vj = + where 
is the orthogonal projection of V, onto the line spanned by 
Xj and is the radial component of Vj perpendicular to 



Vj . Then given < jj- < 1, we restrict ||Vj^||2 < and 
y^l — r| — 1 < T^" < 0. The orthogonal projection onto this 
set is straightforward to compute since it is a box constraint 
in cylindrical coordinates. This constraint set for V} is shown 
in Figure |5] in the case when Cj = 0. 




Fig. 5: Region of possible values for Xj + V, 

We also allow for a few columns of the data to be outliers. 
These are columns of X that we don't expect to be well 
represented as a small error plus a sparse nonnegative linear 
combination of other data, but that we also don't want to 
consider as endmembers. Given some 7 > 0, this sparse error 
is modeled as —Xg diag e with e restricted to the convex 
set i? = {e : e > and ^j[Cwe)j < 7}. Since E is 
the nonnegative region of a weighted li ball, the orthogonal 
projection onto E can be computed with 0{dg\og{dg)) com- 
plexity. Here, since the weights wj sum to one by definition, 
7 can be roughly interpreted as the fraction of data we expect 
to be outliers. For non-outlier data Xj, we want Cj w 
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Fig. 3: Results on the urban hyperpectral image. First row, left: RGB image to compare the fraction planes to. First row, middle 
left: spectral signatures of the endmembers our method found. First row middle right, right and second row: abundance maps 
of the six endmembers. 



and for outlier data we want Cj w 1. In the latter outlier 
case, regularization on the matrix T should encourage the 
corresponding column Tj to be close to zero, so ||yTj||2 is 
encouraged to be small rather than close to one. 

Keeping the oo regularization, the nonnegativity constraint 
and the weighted li penalty from Equation (|6]l, the overall 
extended model is given by 



min C V max(Ti ,) + (i?„CTC„, T) (15) 

T>O.VjeDj.eeE ^ 3 
i 

such that YT - Xs = V - X, diag(e) . 



The structure of this model is similar to the robust PCA model 
proposed in f28l even though it has a different noise model 
and uses /i_oo regularization instead of the nuclear norm. 



B. Numerical optimization 

Since the convex functional for the extended model ( [T5| l is 
slightly more complicated, it is convenient to use a variant 
of ADMM that allows the functional to be split into more 
than two parts. The method proposed by He, Tao and Yuan 
in ll29l is appropriate for this application. Again introduce a 
new variable Z and the constraint Z = T. Also let Pi and 
P2 be Lagrange multipliers for the constraints Z — T = 
and Y Z — V — Xg + Xs diag(e) — respectively. Then the 



augmented Lagrangian is given by 

Ls{Z, T, V, e, Pi, P2) =.9>o(T) + goiV) + gsie) 

+ {Pi,Z-T) 

+ (P2, YZ-V-Xs+X, diag(e)) 



\Z-T\\j, 



\YZ-V-Xs+X, diag(e)|||, 



where and gE are indicator functions for the V <E D and 
e ^ E constraints. 

Using the ADMM-like method in |l29l, a saddle point of 
the augmented Lagrangian can be found by iteratively solving 
the following subproblems with parameters S > and fi > 2, 



arg mm 

z 





1' 






Z~ 




Y 





i-pk 

y'=-X,diag(e'=)+X, 



-pk- 




.2 . 





r'-^+i - argmin.g>o(T) + max(T,,,) 



yk+1 ^ 



SiTgmmgD{V) + ^\\V -V'' 
1(YZ''+' -V'^+X, diag(e'^) - X,) ~ ^ 
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argmin5£;(e) 



2 ^ 

k+1 



(X,),j(P| + 5{YZ''+' -V + X, diag( 



Xs)) 



2 J 



pfc+i ^ pfc ^ _ y/t+i _ X, + X, diag(e'=+^)) 

Each of these subproblems can be efficiently solved. There 
are closed formulas for the Z^'^^ and 1/*^+^ updates, and the 
gfe+i ^jj^ rpk+i upjjafgs both involve orthogonal projections 
that can be efficiently computed. 

C. Effect of extended model 

A helpful example for visualizing the effect of the extended 



model (15 1 is to apply it to an RGB image. Even though 



the low dimensionality makes this significantly different from 
hyperspectral data, it's possible to view a scatter plot of the 
colors and how modifying the model parameters affects the 
selection of endmembers. The NMR data in Section IV-EI is 
four dimensional, so low dimensional data is not inherently 
unreasonable. 

For the following RGB experiments, we use the same 



parameters as described in Section II-E and use the same k- 



means with farthest first initialization strategy to reduce the 
size of the initial matrix Y . We do not however perform the 
alternating minimization refinement step. Due to the different 
algorithm used to solve the extended model, there is an 
additional numerical parameter /i, which for this application 
must be greater than two according to ||29l . We set /i equal to 
2.01. There are also model parameters Vj and 7 for modeling 



the noise and outliers. To model the small scale noise V, 
we set Tj = f] + rrij, where 7/ is fixed at .07 and rrij is 
2 the maximum distance from data in cluster j to the cluster 
center Yj . To model the sparse error e, we use several different 
values of 7, which should roughly correspond to the fraction 
of data we are allowed to ignore as outliers. We will also use 
different values of setting = 50 to encourage a sparser 
abundance matrix and setting ly = to remove the weighted 
li penalty from the model. Figure |6] shows the data, which is a 
color image of multicolored folders, the selected endmembers 
(black dots) from four different experiments, and the sparse 
abundance matrix T for one of the experiments. 

Note that in the ~ experiment, sparsity of T is not 
encouraged, so the dense gray cluster in the center of the cone, 
which corresponds to the floor in the image, is not selected 
as an endmember Additionally, the selected endmembers are 
towards the very outside of the cone of data. In all the 
other experiments where ly = 50 the floor is selected as an 
endmember even though it is in the center of the cone of data. 
This results in a much sparser matrix T. Moreover, the selected 
endmembers tend to be in the center of the clusters of colors 
that we see in the scatter plot of the folder data. Finally we 
note that as we increase the parameter 7, fewer endmembers 
are selected and some of the smaller outlying color clusters 
are ignored. 



D. Comparison between the base and the outlier model 

To illustrate the difference between the outlier and the base 
model we use the same dataset as for the supervised endmem- 
ber detection experiment and first repeat the experiment from 
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original image 




= 0,7 = 



= 50,7 = -005 



7 = .005 



= 50,7 = -01 



1/ = 50,7 = .1 




25 30 35 40 




Fig. 6: Results of the extended model applied to RGB image. Upper left: RGB image we apply the blind unmixing algorithm 
to. Upper middle: 3d plot of the data points in the image in their corresponding color Shown as black dots are the endmembers 
detected without allowing outliers (7 = 0) and without encouraging particular sparsity on the coefficients {y — 0). Upper right: 
With allowing some outliers the method removed an endmember in the one of the outside clusters, but included the middle 
cluster due to the encouraged sparsity. Lower left: Endmember coefficients for the parameter choice = 50, 7 = 0.005, where 
the brightness corresponds to the coefficient value. We can see that the coefficient matrix is sparse. Lower middle: Increasing 
the allowed outliers the red cluster endmember is removed. Increasing the outliers even further leads to decreasing the number 
of endmembers to 4. 



Section IV-Al with 30 data points for each endmember, 20 
data points for each combination of two different endmem- 
bers, 10 data points for each combination of three different 
endmembers, and additionally 30 data points as mixtures of 
all endmembers. We add Gaussian noise with zero mean and 
standard deviation 0.005, run the outlier model with C = 1^ 
rj = 0.08, 7 = 0.01, V = AQ and run the basic model with 
( = 1.3, jy = 40, /3 = 250. Figure |7] shows the results for both 
methods with their average angle of deviation from the true 
endmembers. 

We can see that both models give good results close to 
the ground truth. Due to the Gaussian noise we added, the 
best possible choice of columns of X deviates by an average 
angle of 3.29 degrees from the true endmembers. Due to the 
refinement step both methods could achieve an average angle 
below this value. The main remaining deviation is mainly due 
to the methods selecting an almost straight line rather then the 
corresponding true endmember 

As a second experiment we want to simulate outliers in 
the data. First we again create a mixed pixel data set with 
nine endmembers as above, but without adding Gaussian noise. 
Next, we create a spike signal and add the spike itself as well 
as about 3% data that is mixed with the spike signal. This shall 



simulate a small fraction of the data being outliers. Again, we 
run the basic as well as the outlier model on this dataset and 
obtain the results shown in Figure |8] The upper left image 
shows the true nine endmembers plus the spike signal we used 
to create the outliers with. As we can see in the image on the 
upper right, which shows the result of the basic model, the 
algorithm selected the spike as an endmember. This is very 
reasonable because although only 3% of the data contains parts 
of the spike, it is a strong outlier and hence expensive for the 
fidelity term in the Frobenius norm to exclude. The shown ten 
detected endmembers deviate only by 3.6 degrees from the 
nine true endmembers and the spike. 

The second row of Figure|8]shows the nine true endmembers 
on the left and the result of the outlier model on the right. As 
we can see the outlier model was able to handle the small 
fraction of signals containing the spike and only selected the 
nine true endmembers. The average angle of deviation in this 
case is 2.7 degrees and we can confirm that the model behaves 
like we expected it to. 

E. Application to blind source separation of NMR data 

To illustrate how our model can be applied to BSS problems, 
we use it to recover the four NMR source spectra from (||T| 
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Fig. 7: Comparison between the basic and the outlier method on Indian pines data with Gaussian noise 
nine endmember signals and spike o asr reconstruction of the basic model 
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Fig. 8: Comparison between the basic and the outlier method on Indian pines data with outliers 
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Fig. 4) from four noise free mixtures. The four sources are 
shown in Figure [9] Let Sq S E^xsooo jj^g sources and let 
the mixtures Xo be generated by Xq — AqSq with 



An 



We will use the outlier model to recover the mixing matrix 
A from Xq. Unlike the hyperspectral examples, some columns 
of Xq here can be nearly zero if all sources are simultaneously 
zero at the same spectral index. We can see from Figure 
|9] that this is indeed the case. Since our algorithm uses 
normalized data, we first remove columns of Xq whose norm 
is below some threshold, which we take to be .01 maxj 
We then normalize the remaining columns to get X. This 



simple approach suffices for this example, but in general the 
parameters rj for the V E D constraint could also be modified 
to account for columns of Xq that have significantly different 
norms. 

A minor difficulty in applying our method to this BSS 
problem is that we know A should have four columns but 
there is no way to constrain the algorithm to produce a 
dictionary with exactly four elements. We therefore adjust 
parameters until the dimension of the result is correct. This is 
straightforward to do and could be automated. For example, to 
choose a smaller dictionary, we can reduce v and/or increase 
7- 

The parameters used here are identical to those used in the 
RGB experiments of Section V-C except 7 = .01 and h' ~ 5. 
Also, for the data reduction step, the angle constraint was 



13 



iM 






500 


1000 


15DD 


2000 


2500 


3000 


3500 


400D 


4600 


5000 


- 








1 


















500 


1000 


15DD 


2000 


2500 


3000 


3500 


400D 


4500 


5000 








\l. 
















500 


1000 


15DD 


2000 


2500 


3000 


3500 


400D 


4500 


5000 










■ 




1 













,1 1 ^ 1 1 1 . 1 . ^ . Al 1 1 1 1 

500 1000 15DD 2000 2500 3000 3500 400D 4500 5000 



spectral index 

Fig. 9: Four NMR source spectra from |r| Fig. 4 
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increased to {Yi,Yj) < .998. The computed mixing matrix 
after permutation is 



A 



Note that the columns of A are normalized because the 
algorithm selects dictionary elements from a normalized ver- 
sion of the data X. Since for this simple problem, A is 
invertible, it is straightforward to recover the sources by 
S — max{0, A^^ Xq). More generally, we can recover S by 
minimizing the convex functional in Equation (j7]i with respect 
to S using the un-normalized data matrix Xq and the computed 
endmembers A. 

VI. Future research 

We have presented a convex method for factoring a data 
matrix X into a product AS with 5 > under the constraint 
that the columns of A appear somewhere in the data X. This 
type of factorization ensures the physical meaning of the dic- 
tionary A, and we have successfully applied it to hyperspectral 
endmember detection and blind source separation in NMR. 
For non-repeating noise free data, the Zi.oo regularization was 
proven to be an exact relaxation of the row-0 norm. We further 
proposed an extended model that can better handle outliers. 
Possible future application areas include computational biol- 
ogy, sensor networks, and in general dimensionality reduction 
and compact representation applications where the physical 
interpretation of the reduced space is critical. It will also be 
interesting to try and extend our convex model to the class 
of problems discussed in f30l for which the pixel purity or 
non-overlapping assumption is approximately but not exactly 
satisfied. 



Acknowledgments - We would like to acknowledge Laura 
Balzano and Rob Nowak, who are working on a similar 
selection framework without non-negativity constraints, par- 
tially motivated by problems in sensor networks, and had 
provided important comments and feedback. We would also 
like to thank Nicolas Gillis, John Greer and Todd Wittman for 
helpful discussions about endmember detection strategies, and 
Yuanchang Sun for his advice on BSS problems. 

Appendix 
Proof of Lemma [11172] 
Proof: Since is a minimizer of we can conclude 
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N\ 



=0 



IIT-Idl 



(16) 



Since 77 — > 0, is bounded in the \\ ■ Wi.os norm and 
therefore has a convergent subsequence. Let T^^^ denote such 
a convergent subsequence and let T be its limit. Because 
T*" > we also have T > 0. We can now use the above 
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estimate for showing 



ITI 



l.OO 



= limliTi 



< 



Furthermore, we have 



\XT-X\\l 



lim [||r||i.oo - 

n 

ll^lll.oo. 

limiix'^-ri" 



\T- 



W|||] 



(17) 



< lim J„„(T^") 

n 

< lim Ja„ (f ) 

n 

< limllX-^-f -X-^" 

n 

< lim<52||f 

n 

= 0. 



p + Q!„|lT||i,oo 
arJ|T|ll,oo 



(18) 



Now, by the above estimate (18i we know that XT = X. 
Furthermore, by the estimate ( 17i and taking into account that 
T was a non-negative || • ||i oo-minimum norm solution of 
XT ~ X, we have shown that the limit of our convergent 
subsequence is also a non-negative || • || l oo -minimum norm 
solution of XT ^X. ■ 
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